SMART in TREC 8
نویسندگان
چکیده
This year was a light year for the Smart Information Retrieval Project at SabIR Research and Cornell. We oÆcially participated in only the Ad-hoc Task and the Query Track. In the Ad-hoc Task, we made minor modi cations to our document weighting schemes to emphasize high-precision searches on shorter queries. This proved only mildly successful; the top relevant document was retrieved higher, but the rest of the retrieval tended to be hurt. Our Query Track runs are described here, but the much more interesting analysis of these runs is described in the Query Track Overview. Basic Indexing and Retrieval In the Smart system, the vector-processing model of retrieval is used to transform both the available information requests as well as the stored documents into vectors of the form: Di = (wi1; wi2; : : : ; wit) where Di represents a document (or query) text and wik is the weight of term Tk in document Di. A weight of zero is used for terms that are absent from a particular document, and positive weights characterize terms actually assigned. The assumption is that t terms in all are available for the representation of the information. The basic \tf*idf" weighting schemes used within SMART have been discussed many times. For TREC 8 we made a slight modi cation to Lnu-ltu weights we have used in the past 4 years in TREC 4{7. We noticed that the pivoted byte-length document normalization used by Singhal et al in TREC 7([5]) seems to favor high precision searches when used with short queries. It is a bit more biased towards shorter documents than our previous \u" scheme which uses number of unique terms in the document, thus a good short document containing all the query terms will be ranked highly. We hoped that this would enable our blind feedback query expansion to be based on more relevant documents and thus improved. The same phrase strategy (and phrases) used in all previous TRECs (for example [2, 3, 4, 1]) are used for TREC 8. Any pair of adjacent non-stopwords is regarded as a potential phrase. The nal list of phrases is composed of those pairs of words occurring in 25 or more documents of the initial TREC 1 document set. Phrases are weighted with the same scheme as single terms. Note that no human expertise in the subject matter is required for either the initial collection creation, or the actual query formulation. When the text of document Di is represented by a vector of the form (di1; di2; : : : ; dit) and query Qj by the vector (qj1; qj2; : : : ; qjt), a similarity (S) computation between the two items can conveniently be obtained as the inner product between corresponding weighted term vectors as follows:
منابع مشابه
CINDOR Conceptual Interlingua Document Retrieval: TREC-8 Evaluation
The TREC-8 evaluation of the CINDOR system was based on English and French data from the cross-language retrieval track. Our objective was to continue our investigation of our conceptual interlingua approach to cross-language retrieval, specifically by measuring the contribution of conceptual retrieval over and above a baseline cross-language retrieval approach based on machine translation of q...
متن کاملCINDOR TREC-9 English-Chinese Evaluation
MNIS-TextWise Labs participated in the TREC-9 Chinese Cross-Language Information Retrieval track. The focus of our research for this participation has been on rapidly adding Chinese capabilities to CINDOR using tools for automatically generating a Chinese Conceptual Interlingua from existing lexical resources. For the TREC-9 evaluation we also built a version of our system which loosely integra...
متن کاملNew Retrieval Approaches Using SMART: TREC 4
The Smart information retrieval project emphasizes completely automatic approaches to the understanding and retrieval of large quantities of text. We continue our work in TREC 4, performing runs in the routing, ad-hoc, confused text, interactive, and foreign language environments.
متن کاملAnswering Live Questions from Heterogeneous Data Sources SMART in Live QA at TREC 2016
A significant portion of information is today available in a digital format. However, users still face difficulties in accessing it. A big portion of the challenge consists in designing efficient approaches for reasoning over heterogeneous data sources. In this paper, we describe the participation of the Semantic Search and Question Answering group (SMART) in Live QA track at TREC 2016. SMART s...
متن کاملProbabilistic retrieval based on document representations
Accessing information in multimedia databases encompasses a wide range of applications in which spoken document retrieval (SDR) plays an important role. In the recent past, research increasingly focused on the development of heuristic and probabilistic retrieval metrics that are suitable for retrieving spoken documents. So far, many heuristic retrieval metrics, e.g. the SMART-2 metric, have bee...
متن کامل